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Abstract —Timing and cache side channels provide powerful 
attacks against many sensitive operations including cryptographic 
implementations. Existing defenses cannot protect against all 
classes of such attacks without incurring prohlhitlve performance 
overhead. A popular strategy for defending against all classes 
of these attacks is to modify the implementation so that the 
timing and cache access patterns of every hardware instruction 
is independent of the secret inputs. However, this solution is 
architecture-specific, brittle, and difficult to get right. In this 
paper, we propose and evaluate a robust low-overhead technique 
for mitigating timing and cache channels. Our solution requires 
only minimal source code changes and works across multiple lan¬ 
guages/platforms. We report the experimental results of applying 
our solution to protect several C, C-i-i-, and Java programs. Our 
results demonstrate that our solution successfully eliminates the 
timing and cache side-channel leaks while incurring significantly 
lower performance overhead than existing approaches. 


I. Introduction 

Defending against cache and timing side channel attacks 
is known to be a hard and important problem. Timing and 
cache attacks can be used to extract cryptographic secrets 
from running systems |[T4] [15] [13 |29| [33 [33 [^ [40), spy 
on Web user activity ca, and even undo the privacy of 
differential privacy systems smi. Attacks exploiting timing 
side channels have been demonstrated for both remote and 
local adversaries. A remote attacker is separated from its target 
by a network |[T3[l5l|29l[36l while a local attacker can execute 
unprivileged spyware on the target machine |[7| [9] [TT| [33 SH 

113. 

Most existing defenses against cache and timing attacks 
only protect against a subset of attacks and incur significant 
performance overheads. For example, one way to defend 
against remote timing attacks is to make sure that the timing of 
any externally observable events are independent of any data 
that should be kept secret. Several different strategies have 
been proposed to achieve this, including application-specific 
changes im M [30l . static transformation na, and 
dynamic padding 0 [13 |24| [3T] [13. However, none of these 
strategies defend against local timing attacks where the attacker 
spies on the target application by measuring the target’s impact 
on the local cache and other resources. Similarly, the strategies 
for defending against local cache attacks like static partitioning 
of resources @[33 ill El, flushing state EO), obfuscating 
cache access patterns ii [13 [13 [32 ia, and moderating 
access to fine-grained timers 133 Elia, also incur significant 
performance penalties while still leaving the target potentially 
vulnerable to timing attacks. We survey these methods in 
related work (Section jVIIIj ). 


A popular approach for defending against both local and 
remote timing attacks is to ensure that the low-level instruction 
sequence does not contain instructions whose performance 
depends on secret information. This can be enforced by 
manually re-writing the code, as was done in OpenSSlQ or by 
changing the compiler to ensure that the generated code has 
this property ||20| . 

Unfortunately, this popular strategy can fail to ensure 
security for several reasons. First, the timing properties of 
instructions may differ in subtle ways from one architecture 
to another (or even from one processor model to another) 
resulting in an instruction sequence that is unsafe for some 
architectures/processor models. Second, this strategy does not 
work for languages like Java where the Java Virtual Machine 
(JVM) optimizes the bytecode at runtime and may inad¬ 
vertently introduce secret-dependent timing variations. Third, 
manually ensuring that a certain code transformation prevents 
timing attacks can be extremely difficult and tedious, as was 
the case when updating OpenSSL to prevent the Lucky-thirteen 
timing attack ll32]| . 

Our contribution. We propose the first low-overhead, 
application-independent, and cross-language defense that can 
protect against both local and remote timing attacks with 
minimal application code changes. We show that our defense 
is language-independent by applying the strategy to protect 
applications written in Java and C/C-H-. Our defense requires 
relatively simple modifications to the underlying OS and can 
run on off-the-shelf hardware. 

We implement our approach in Linux and show that the 
execution times of protected functions are independent of 
secret data. We also demonstrate that the performance overhead 
of our defense is low. For example, the performance overhead 
to protect the entire state machine running inside a SSL/TLS 
server against all known timing- and cache-based side channel 
attacks is less than 5% in connection latency. 

We summarize the key insights behind our solution (de¬ 
scribed in detail in Section |IV| i below. 

• We leverage programmer code annotations to identify 
and protect sensitive code that operates on secret data. 
Our defense mechanism only protects the sensitive func¬ 
tions. This lets us minimize the performance impact of 
our scheme by leaving the performance of non-sensitive 
functions unchanged. 


’in the case of RSA private key operations, OpenSSL uses an additional 
defense called blinding. 





• We farther minimize the performance overhead by sepa¬ 
rating and accurately accounting for secret-dependent and 
secret-independent timing variations. Secret-independent 
timing variations (e.g., the ones caused by interrupts, the 
OS scheduler, or non-secret execution flow) do not leak 
any sensitive information to the attacker and thus are 
treated differently than secret-dependent variations by our 
scheme. 

• We demonstrate that existing OS services like schedulers 
and hardware features like memory hierarchies can be 
leveraged to create a lightweight isolation mechanism that 
can protect a sensitive function’s execution from other 
local untrusted processes and minimize timing variations 
during the function’s execution. 

• We show that naive implementations of delay loops in 
most existing hardware leak timing information due to 
the underlying delay primitive’s (e.g., NOP instruction) 
limited accuracy. We create and evaluate a new scheme 
for implementing delay loops that prevents such leakage 
while still using existing coarse-grained delay primitives. 

• We design and evaluate a lazy state cleansing mechanism 
that clears the sensitive state left in shared hardware 
resources (e.g., branch predictors, caches, etc.) before 
handing them over to an untrusted process. We find that 
lazy state cleansing incurs significantly less overhead than 
performing state cleaning as soon as a sensitive function 
finishes execution. 

II. Known timing attacks 

Before describing our proposed defense we briefly survey 
different types of timing attackers. In the previous section, we 
discussed the difference between a local and a remote timing 
attacker: a local timing attacker, in addition to monitoring the 
total computation time, can spy on the target application by 
monitoring the state of shared hardware resources such as the 
local cache. 

Concurrent vs. non-concurrent attacks. In a concurrent 
attack, the attacker can probe shared resources while the target 
application is operating. For example, the attacker can measure 
timing information or inspect the state of the shared resources 
at intermediate steps of a sensitive operation. The attacker’s 
process can control the concurrent access by adjusting its 
scheduling parameters and its core affinity in the case of 
symmetric multiprocessing (SMP). 

A non-concurrent attack is one in which the attacker only 
gets to observe the timing information or shared hardware state 
at the beginning and the end of the sensitive computation. 
For example, a non-concurrent attacker can extract secret 
information using only the aggregate time it takes the target 
application to process a request. 

Local attacks. Concurrent local attacks are the most prevalent 
class of timing attacks in the research literature. Such attacks 
are known to be able to extract the secret/private key against 
a wide-range of ciphers including RSA ||4]|36l, AES El 
|40l|46l, and ElGamal ll49l . These attacks exploit information 
leakage through a wide range of shared hardware resources: LI 
or L2 data cache ll2^lT5ll^l40ll . L3 cache Il26ll46l . instruction 
cache Eim, branch predictor cache Glia, and floating-point 
multiplier i). 


There are several known local non-concurrent attacks as 
well. Osvik et al. Ea, Tromer et al. BOll . and Bonneau 
and Mironov m present two types of local, non-concurrent 
attacks against AES implementations. In the first, prime and 
probe, the attacker “primes” the cache, triggers an AES en¬ 
cryption, and “probes” the cache to learn information about the 
AES private key. The spy process primes the cache by loading 
its own memory content into the cache and probes the cache by 
measuring the time to reload the memory content after the AES 
encryption has completed. This attack involves the attacker’s 
spy process measuring its own timing information to indirectly 
extract information from the victim application. Alternatively, 
in the evict and time strategy, the attacker measures the time 
taken to perform the victim operation, evicts certain chosen 
cache lines, triggers the victim operation and measure its 
execution time again. By comparing these two execution times, 
the attacker can And out which cache lines were accessed 
during the victim operation. Osvik et al. were able to extract 
an 128-bit AES key after only 8,000 encryptions using the 
prime and probe attack. 

Remote attacks. All existing remote attacks llT4l [TSl lEl [36l 
are non-concurrent, however this is not fundamental. A hy¬ 
pothetical remote, yet concurrent, attack would be one in 
which the remote attacker submits requests to the victim 
application at the same time that another non-adversarial client 
sends some requests containing sensitive information to the 
victim application. The attacker may then be able to measure 
timing information at intermediate steps of the non-adversarial 
client’s communication with the victim application and infer 
the sensitive content. 

III. Threat Model 

We allow the attacker to be local or remote and to execute 
concurrently or non-concurrently with the target application. 
We assume that the attacker can only run spy processes as 
a different non-privileged user (i.e., no super-user privileges) 
than the owner of the target application. We also assume 
that the spy process cannot bypass the standard user-based 
isolation provided by the operating system. We believe that 
these are very realistic assumptions because if either one of 
these assumptions fail, the spy process can steal the user’s 
sensitive information without resorting to side channel attacks 
in most existing operating systems. 

In our model, the operating system and the underlying 
hardware are trusted. Similarly, we expect that the attacker 
does not have physical access to the hardware and cannot 
monitor side channels such as electromagnetic radiations, 
power use, or acoustic emanations. We are only concerned 
with timing and cache side channels since they are the easiest 
side channels to exploit without physical access to the victim 
machine. 

IV. Our Solution 

In our solution, developers annotate the functions perform¬ 
ing sensitive computation(s) that they would like to protect. 
Eor the rest of the paper, we refer to such functions as 
protected functions. Our solution instruments the protected 
functions such that our stub code is invoked before and after 
execution of each protected function. The stub code ensures 


that the protected fanctions, all other functions that may be 
invoked as part of their execution, and all the secrets that they 
operate on are safe from both local and remote timing attacks. 
Thus, our solution automatically prevents leakage of sensitive 
information by all functions (protected or unprotected) invoked 
during a protected function’s execution. 

Our solution ensures the following properties for each 
protected function; 

• We ensure that the execution time of a protected function 
as observed by either a remote or a local attacker is 
independent of any secret data the function operates on. 
This prevents an attacker from learning any sensitive in¬ 
formation by observing the execution time of a protected 
function. 

• We clean any state left in the shared hardware resources 
(e.g., caches) by a protected function before handing 
the resources over to an untrusted process. As described 
earlier in our threat model (Section 0, we treat any 
process as untrusted unless it belongs to the same user 
who is performing the protected computation. We cleanse 
shared state only when necessary in a lazy manner to 
minimize the performance overhead. 

• We prevent other concurrent untrusted processes from ac¬ 
cessing any intermediate state left in the shared hardware 
resources during the protected function’s execution. We 
achieve this by efficiently dynamic partitioning the shared 
resources while incurring minimal performance overhead. 


• no user process can preempt protected functions 

• apply padding to make timing secret-independent 

• lazily clean per-core resources 


core 2 


core 3 


protected 

1 

f - 

S untrusted 


^ untrusted 

function 

1 

^ process 


^ process 


Llcache 


Llcache 


Llcache 


L2 cache 


L2 cache 


L2 cache 



per-user page coloring isolates protected 
function's cache lines 


Fig. 1: Overview of our solution 


Figure [T] shows the main components of our solution. 
We use two high-level mechanisms to provide the properties 
described above for each protected function; time padding and 
preventing leakage through shared resources. We first briefly 
summarize these mechanisms below and then describe them 
in detail in Sections IIV-AI and IIV-BI 

Time padding. We use time padding to make sure that 
a protected function’s execution time does not depend on 


the secret data. The basic idea behind time padding is sim¬ 
ple—pad the protected function’s execution time to its worst- 
case runtime over all possible inputs. The idea of padding 
execution time to an upper limit to prevent timing channels 
itself is not new and has been explored in several prior 
projects il ES iMl E] El]. However, all these solutions 
suffer from two major problems which prevent them from 
being adopted in real-world setting; i) they incur prohibitive 
performance overhead (90 — 400% in macro-benchmarks m) 
because they have to add a large amount of time padding in 
order to prevent any timing information leakage to a remote 
attacker, and ii) they do not protect against local adversaries 
who can infer the actual unpadded execution time through side 
channels beyond network events (e.g., by monitoring the cache 
access patterns at periodic intervals). 

We solve both of these problems in this paper. One of our 
main contributions is a new low-overhead time padding scheme 
that can prevent timing information leakage of a protected 
function to both local and remote attackers. We minimize 
the required time padding without compromising security by 
adapting the worst-case time estimates using the following 
three principles; 

1) We adapt the worst-case execution estimates to the target 
hardware and the protected function. We do so by pro¬ 
viding an offline profiling tool to automatically estimate 
worst-case runtime of a particular protected function 
running on a particular target hardware platform. Prior 
schemes estimate the worst-case execution times for 
complete services (i.e., web servers) across all possible 
hardware configurations. This results in an over-estimate 
of the time pad that hurts performance. 

2) We protect against local (and remote) attackers by ensur¬ 
ing that an untrusted process cannot intervene during a 
protected function’s execution. We apply time padding at 
the end of every protected function’s execution. This en¬ 
sures minimal overhead while preventing a local attacker 
from learning the running time of protected functions. 
Prior schemes applied a large time pad before sending a 
service’s output over the network. Such schemes are not 
secure against local attackers who can use local resources, 
such as cache behavior, to infer the execution time of 
individual protected functions. 

3) Timing variations result from many factors. Some are 
secret-dependent and must be prevented, while others 
are secret independent and cause no harm. For example, 
timing variations due to the OS scheduler and interrupt 
handlers are generally harmless. We accurately measure 
and account for secret-dependent variations and ignore 
the secret-independent variations. This lets us compute an 
optimal time pad needed to protect secret data. None of 
the existing time padding schemes distinguish between 
the secret-dependent and secret-independent variations. 
This results in unnecessarily large time pads, even when 
secret-dependent timing variations are small. 

Preventing leaks via shared resources. We prevent in¬ 
formation leakage through shared resources without adding 
significant performance overhead to the process executing the 
protected function or to other (potentially malicious) processes. 
Our approach is as follows; 
































































• We leverage the multi-core processor architecture found 
in most modern processors to minimize the amount of 
shared resources during a protected function’s execution 
without hurting performance. We dynamically reserve 
exclusive access to a physical core (including all per- 
core caches such as LI and L2) while it is executing 
a protected function. This ensures that a local attacker 
does not have concurrent access to any per-core resources 
while a protected function is accessing them. 

• For L3 caches shared across multiple cores, we use page 
coloring to ensure that cache accesses during a protected 
function’s execution are restricted within a reserved por¬ 
tion of the L3 cache. We further ensure that this reserved 
portion is not shared with other users’ processes. This 
prevents the attacker from learning any information about 
protected functions through the L3 cache. 

• We lazily cleanse the state left in both per-core resources 
(e.g., L1/L2 caches, branch predictors) and resources 
shared across cores (e.g., L3 cache) only before handing 
them over to untrusted processes. This minimizes the 
overhead caused by the state cleansing operation. 


A. Time padding 

We design a safe time padding scheme that defends against 
both local and remote attackers inferring sensitive information 
from observed timing behavior of a protected function. Our de¬ 
sign consists of two main components: estimating the padding 
threshold and applying the padding safely without leaking any 
information. We describe these components in detail next. 


Determining the padding value. Our time padding only 
accounts for secret-dependent time variations. We discard 
variations due to interrupts or OS scheduler preemptions. To 
do so we rely Linux’s ability to keep track of the number of 
external preemptions. We adapt the total padding time based 
on the amount of time that a protected function is preempted 
by the OS. 

• Let Tmax be the worst-case execution time of a protected 
function when no external preemptions occur. 

• Let Text preempt be the worst-case time spent during pre¬ 
emptions given the set of n preemptions that occur during 
the execution of the protected function. 


Our padding mechanism pads the execution of each protected 
function to Tpadded cycles, where 


padded 


= Te 


ext_preempt 




This leaks the amount of preemption time to the attacker, 
but nothing else. Since this is independent of the secret, the 
attacker learns nothing useful. 


Estimating T^ax- Our time padding scheme requires a tight 
estimate of the worst-case execution time (WCET) of every 
protected function. There are several prior projects that try 
to estimate WCET through different static analysis tech¬ 
niques miEa. However, these techniques require precise and 
accurate models of the target hardware (e.g., cache, branch 
target buffers, etc.) which are often very hard to get in practice. 
In our implementation we use a simple dynamic profiling 
method to estimate WCET described below. Our time padding 


time 



scheme is not tied to any particular WCET estimation method 
and can work with other estimation tools. 

We estimate the WCET, through dynamic offline pro¬ 
filing of the protected function. Since this value is hardware- 
specific, we perform the profiling on the actual hardware 
that will run protected functions. To gather profiling informa¬ 
tion, we run an application that invokes protected functions 
with an input generating script provided by the application 
developer/system administrator. To reduce the possibility of 
overtimes occurring due to uncommon inputs, it is important 
that the script generate both common and uncommon inputs. 
We instrument the protected functions in the application so 
that the worst-case performance behavior is stored in a profile 
file. We compute the padding parameters based on the profiling 
results. 

To be conservative, we obtain all profiling measurements 
for the protected functions under high load conditions (i.e., in 
parallel with other application that produces significant loads 
on both memory and CPU). We compute T^ax from these 
measurements such that it is the worst-case timing bound when 
at most a k fraction of all profiling readings are excluded, k is a 
security parameter which provides a tradeoff between security 
and performance. Higher values of k reduce T^ax but increase 
the chance of overtimes. Eor our prototype implementation we 
set K to 10“®. 

Safely applying padding. Once the padding amount has been 
determined using the techniques described earlier, waiting for 
the target amount might seem easy at first glance. However, 
there are two major issues that make application of padding 
complicated in practice as described below. 

Handling limited accuracy of padding loops. As our solution 
depends on fine-grained padding, a naive padding scheme may 
leak information due to limited accuracy of any padding loops. 
Eigure shows that a naive padding scheme that repeatedly 
measures the elapsed time in a tight loop until the target time 
is reached leaks timing information. This is because the loop 
can only break when the condition is evaluated, and hence 
if one iteration of the loop takes u cycles then the padding 
loop leaks timing information mod u. Note that earlier timing 
padding schemes do not get affected by this problem as their 
padding amounts are significantly larger than ours. 

Our solution guarantees that the distribution of running 
times of a protected function for some set of private inputs 
is indistinguishable from the same distribution produced when 
a different set of private inputs to the function are used. We 









call this property the safe padding property. We overcome 
the limitations of the simple wait loop by performing a 
timing randomization step before entering the simple wait 
loop. During this step, we perform m rounds of a randomized 
waiting operation. This goal of this step is to ensure that the 
amount of time spent in the protected function before the 
beginning of the simple wait loop, when taken modulo u, the 
stable period of the simple timing loop (i.e. disregarding the 
first few iterations), is close to uniform. This technique can be 
viewed as performing a random walk on the integers modulo u 
where the runtime distribution of the waiting operation is the 
support of the walk and m is the number of steps walked. Prior 
work by Chung et al. ifThll has explored the sufficient conditions 
for the number of steps in a walk and its support that produce 
a distribution that is exponentially close to uniform. 

For the purposes of this paper, we perform timing random¬ 
ization using a randomized operation with 256 possible inputs 
that runs for X + c cycles on input X where c is a constant. 
We describe the details of this operation in Section [V] We 
then choose m to defeat our empirical statistical tests under 
pathological conditions that are very favorable to an attacker 
as shown in Section m 

For our scheme’s guarantees to hold, the randomness used 
inside the randomized waiting operation must be generated 
using a cryptographically secure generator. Otherwise, if an 
attacker can predict the added random noise, she can subtract 
it from the observed padded time and hence derive the original 
timing signal, modulo u. 

A padding scheme that pads to the target time Tpadded 
using a simple padding loop and performs the randomization 
step after the execution of the protected function will not 
leak any information about the duration of the protected 
function, as long as the following conditions hold; (i) no 
preemptions occur; (ii) the randomization step successfully 
yields a distribution of runtimes that is uniform modulo m; 
(iii) The simple padding loop executes for enough iterations 
so that it reaches its stable period. The security of this scheme 
under these assumptions can be proved as follows. 

Let us assume that the last iteration of the simple wait 
loop take u cycles. Assuming the simple wait loop has iterated 
enough times to reach its stable period, we can safely assume 
that u does not depend on when the simple wait loop started 
running. Now, due to the randomization step, we assume that 
the amount of time spent up to the start of the last iteration of 
the simple wait loop, taken modulo u, is uniformly distributed. 
Hence, the loop will break at a time that is between the 
target time and the target time plus u — 1. Because the last 
iteration began when the elapsed execution time was uniformly 
distributed modulo u, these u different cases will occur with 
equal probability. Hence, regardless of what is done within 
the protected function, the padded duration of the function 
will follow a uniform distribution of u different values after 
the target time. Therefore, the attacker will not learn anything 
from observing the padded time of the function. 

To reduce the worst-case performance cost of the random¬ 
ization step, we generate the required randomness at the start 
of the protected function, before measuring the start time of 
the protected function. This means that any variability in the 
runtime of the randomness generator does not increase Tpadded- 
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// At the return point of a protected function: 

// Ttegin holds the time at function start 

// hegin holds the preemption count at function start 

for j = 1 io m 

Short-Random-DelayO 


^target — ^begin ^max 

overtime = 0 

for i = 1 to 00 

before = Current-Time () 
while Current-Time() < T) 


target' 


re-check. 


// Measure preemption count and adjust target 

^extjpreempt — (Pieemptions() Ihegin) ' ^penalty 


T^next — ^begin “f ^max “f T^xt_preempt Overtime 
// Overtime-detection support 
if before > Tnext and overtime = 0 
overtime = 

T^next — ^next Overtime 

H If no adjustment was made, break 


if T^next — ^target 

return 

^target — ^next 


Fig. 3: Algorithm for applying time padding to a protected function’s 
execution. 


Handling preemptions occurring inside the padding loop. 
The scheme presented above assumes that no external pre¬ 
emptions can occur during the the execution of the padding 
loop itself However, blocking all preemptions during the 
padding loop will degrade the responsiveness of the system. To 
avoid such issues, we allow interrupts to be processed during 
the execution of the padding loop and update the padding 
time accordingly. We repeatedly update the padding time in 
response to preemptions until a “safe exit condition” is met 
where we can stop padding. 

Our approach is to initially pad to the target value Tpadded, 
regardless of how many preemptions occur. We then repeatedly 
increase Text preempt and pad to the new adjusted padding target 
until we execute a padding loop where no preemptions occur. 
The pseudocode of our approach is shown in Figure Our 
technique does not leak any information about the actual 
runtime of the protected function as the final padding target 
only depends on the pattern of preemptions but not on the 
initial elapsed time before entering the padding loops. Note 
that forward progress in our padding loops is guaranteed as 
long as preemptions are rate limited on the cores executing 
protected functions. 

The algorithm computes Text preempt based on observed 
preemptions simply by multiplying a constant Tpenaity by the 
number of preemptions. Since Text preempt should match the 
worst-case execution time of the observed preemptions, Tpenaity 
is the worst-case execution time of any single preemption. 
Like Tmax, Tpenaity IS machine specific and can be determined 
empirically from profiling data. 

Handling overtimes. Our WCET estimator may miss a 
pathological input that causes the protected function to run for 
significantly more time than on other inputs. While we never 


observed this in our experiments, if such a pathological input 
appeared in the wild, the protected function may take longer 
than the estimated worst-case bound and this will result in 
an overtime. This leaks information; the attacker learns that a 
pathological input was just processed. We therefore augment 
our technique to detect such overtimes, i.e., when the elapsed 
time of the protected function, taking interrupts into account, 
is greater than Tpadded- 

One option to limit leakage when such overtimes are 
detected is to refuse to service such requests. The system 
administrator can then act by either updating the secrets (e.g., 
secret keys) or increasing the parameter T^ax of the model. 

We also support updating Tmax of a protected function 
on the fly without restarting the running application. The 
padding parameters are stored in a file that has the same 
access permissions as the application/library containing the 
protected function. This file is memory-mapped when the 
corresponding protected function is called for the first time. 
Any changes to the memory-mapped file will immediately 
impact the padding parameters of all applications invoking the 
protected function unless they are in the middle of applying 
the estimated padding. 

Note that each overtime can at most leak log{N) bits of 
information, where N is the total number of timing measure¬ 
ments observed by the attacker. To see why, consider a string 
of N timing observations made by an attacker with at most 
B overtimes. There can be < such unique strings and 
thus the maximum information content of such a string is 
< Blog{N) bits, i.e., < log{N) bits per overtime. However, 
the actual effect of such leakage depends on how much entropy 
an application’s timing patterns for different inputs have. For 
example, if an application’s execution time for a particular 
secret input is significantly larger than all other inputs, even 
leaking 1 bit of information will be enough for the attacker to 
infer the complete secret input. 

Minimizing external preemptions. Note that even though 
Tpadded does not leak any sensitive information, padding to this 
value will incur significant performance overhead if Text preempt 
is high due to frequent or long-running preemptions during 
the protected function’s execution. Therefore, we minimize the 
external events that can delay the execution of a protected 
function. We describe the main external sources of delays and 
how we deal with them in detail below. 

• Preemptions by other user processes. Under regular 
circumstances, execution of a protected function may 
be preempted by other user processes. This can delay 
the execution of the protected function as long as the 
process is preempted. Therefore, we need to minimize 
such preemptions while still keeping the system usable. 
In our solution, we prevent preemptions by other user 
processes during the execution of a protected function 
by using a scheduling policy that prevents migrating 
the process to a different core and prevents other user 
processes from being scheduled on the same core during 
the duration of the protected function’s execution. 

• Preemptions by interrupts. Another common source 
of preemption is the hardware interrupts served by the 
core executing a protected function. One way to solve 


this problem is to block or rate limit the number of 
interrupts that can be served by a core while executing a 
protected function. However, such a technique may make 
the system non-responsive under heavy load. For this 
reason, in our current prototype solution, we do not apply 
such techniques. 

Note that some of these interrupts (e.g., network inter¬ 
rupts) can be triggered by the attacker and thus can be 
used by the attacker to slow down the protected function’s 
execution. However, in our solution, such an attack in¬ 
creases Text preempt, and hence degrades performance, but 
does not cause information leakage. 

• Paging. An attacker can potentially arbitrarily slow down 
the protected function by causing memory paging events 
during the execution of a protected function. To avoid 
such cases, our solution forces each process executing a 
protected function to lock all of its pages in memory and 
disables page swapping. As a consequence, our solution 
currently does not allow processes that allocate more 
memory than is physically available in the target system 
to use protected functions. 

• Hyperthreading. Hyperthreading is a technique sup¬ 
ported by modern processor cores where one physical 
core supports multiple logical cores. The operating system 
can independently schedule tasks on these logical cores 
and the hardware transparently takes care of sharing the 
underlying physical core. We observed that protected 
functions executing on a core with hyperthreading enabled 
can encounter large amounts of slowdown. This slowdown 
is caused because the other concurrent processes execut¬ 
ing on the same physical core can interfere with access 
to some of the CPU resources. 

One potential way of avoiding this slowdown is to con¬ 
figure the OS scheduler to prevent any untrusted process 
from running concurrently on a physical core with a 
process in the middle of a protected function. However, 
such a mechanism may result in high overheads due 
to the cost of actively unscheduling/migrating a process 
running on a virtual core. For our current prototype 
implementation, we simply disable hyperthreading as part 
of system configuration. 

• CPU frequency scaling. Modem CPUs include mech¬ 
anisms to change the operating frequency of each core 
dynamically at mntime depending on the current work¬ 
load to save power. If a core’s frequency decreases in 
the middle of the execution of a protected function or it 
enters the halt state to save power, it will take longer in 
real-time, increasing Tniax- To reduce such variations, we 
disable CPU frequency scaling and low-power CPU states 
when a core executes a protected function. 

B. Preventing leakage through shared resources 

We prevent information leakage from protected functions 
through shared resources in two ways: isolating shared re¬ 
sources from other concurrent processes and lazily cleansing 
state left in shared resources before handing them over to other 
untrusted processes. Isolating shared resources of protected 
functions from other concurrent processes help in preventing 
local timing and cache attacks as well as improving perfor¬ 
mance by minimizing variations in the mntime of protected 



functions. 

Isolating per-core resources. As described earlier in Sec¬ 
tion ra we disable hyperthreading on a core during a 
protected function’s execution to improve performance. This 
also ensures that an attacker cannot run spy code that snoops 
on per-core state while a protected function is executing. We 
also prevent preemptions from other user processes during the 
execution of protected function and thus ensure that the core 
and its L1/L2 caches are dedicated for the protected function. 

Preventing leakage through performance counters. Modern 
hardware often contain performance counters that keep track of 
different performance events such as the number of cache evic¬ 
tions or branch mispredictions occurring on a particular core. 
A local attacker with access to these performance counters may 
infer the secrets used during a protected function’s execution. 
Our solution, therefore, restricts access to performance mon¬ 
itoring counters so that a user’s process cannot see detailed 
performance metrics of another user’s processes. We do not 
restrict, however, a user from using hardware performance 
counters to measure the performance of their own processes. 

Preventing leakage through L3 cache. As L3 cache is a 
shared resources across multiple cores, we use page coloring 
to dynamically isolate the protected function’s data in the L3 
cache. To support page coloring we modify the OS kernel’s 
physical page allocators so that they do not allocate pages 
having any of C reserved secure page colors, unless the caller 
specifically requests a secure color. Pages are colored based 
on which L3 cache sets a page maps to. Therefore, two pages 
with different colors are guaranteed never to conflict in the L3 
cache in any of their cache lines. 

In order to support page coloring, we disable transparent 
huge pages and set up access control to huge pages. An 
attacker that has access to a huge page can evade the isolation 
provided by page coloring, since a huge page can span 
multiple page colors. Hence, we prevent access to huge pages 
(transparently or by request) for non-privileged users. 

As part of our implementation of page coloring, we also 
disable memory deduplication features, such as kernel same- 
page merging. This prevents a secure-colored page mapped 
into one process from being transparently mapped as shared 
into another process. Disabling memory deduplication is not 
unique to our solution and has been used in the past in 
hypervisors to prevent leakage of information across different 
virtual machines llWI . 

When a process calls a protected function for the first time, 
we invoke a kernel module routine to remap all pages allocated 
by the process in private mappings (i.e., the heap, stack, text- 
segment, library code, and library data pages) to pages that 
are not shared with any other user’s processes. We also ensure 
these pages have a page color reserved by the user executing 
the protected function. The remapping transparently changes 
the physical pages that a process accesses without modifying 
the virtual memory addresses, and hence requires no special 
application support. If the user has not yet reserved any page 
colors or there are no more remaining pages of any of her 
reserved page colors, the OS allocates one of the reserved 
colors for the user. Also, the process is flagged with a ’’secure- 
color” bit. We modify the OS so that it recognizes this flag and 


ensures that the future pages allocated to a private mapping for 
the process will come from a reserved page color for the user. 
Note that since we only remap private mappings, we do not 
protect applications that access a shared mapping from inside 
a protected function. 

This strategy for allocating page colors to users has a minor 
potential downside that such a system restricts the numbers of 
different users’ processes that can concurrently call protected 
functions. We believe that such a restriction is a reasonable 
trade-off between security and performance. 

Lazy state cleansing. To ensure that an attacker does not 
see the tainted state in a per-core resource after a protected 
function finishes execution, we lazily delete all per core re¬ 
sources. When a protected function returns, we mark the CPU 
as “tainted” with the user ID of the caller process. The next 
time the OS attempts to schedule a process from a different 
user on the core, it will first flush all per-CPU caches, including 
the LI instruction cache, LI data cache, L2 cache. Branch 
Translation Buffer (BTB), and Translation lookaside buffer 
(TLB). Such a scheme ensures that the overhead of flushing 
these caches can be amortized over multiple invocations of 
protected functions by the same user. 

V. Implementation 

We built a prototype implementation of our protection 
mechanism for a system running Linux OS. We describe the 
different components of our implementation below. 


A. Programming API 


We implement a new function annotation FIXED TIME 
for the C/C-H- language that indicates that a function should 
be protected. The annotation can be specified either in the 
declaration of the function or at its definition. Adding this 
annotation is the only change to a C/C-H- code base that a 
programmer has to make in order to use our solution. We 
wrote a plugin for the Clang C/C-H- compiler that handles 
this annotation. The plugin automatically inserts a call to 
the function fixed time begin at the start of the protected 
function and a call to fixed_time_end at any return point of 
the function. These functions protect the annotated function 
using the mechanisms described in Section 


Alternatively, a programmer can also call these functions 
explicitly. This is useful for protecting ranges of code within 
function such as the state transitions of the TLS state machine 
(see Section VI-B| i. We provide a Java native interface wrapper 
to both fixed time begin and fixed time end functions, for 
supporting protected functions written in Java. 


B. Time padding 

Eor implementing time padding loops, we read from the 
timestamp counter in x86 processors to collect time measure¬ 
ments. In most modern x86 processors, including the one we 
tested on, the timestamp counter has a constant frequency 
regardless of the power saving state of a processor. We generate 
pseudorandom bytes for the randomized padding step using 
the ChaCha/8 stream cipher jS). We use a value of 300 
for Tpenaity as this bounds the worst-case slowdown due to a 
single interrupt we observed in our experiments. 







Our implementation of the randomized wait operation takes 
an input X and simply performs X + c noops in a loop, where 
c is a large enough value so that the loop takes one cycle 
longer for each additional iteration. We observe that c = 46 is 
sufficient to achieve this property. 

Some of the OS modifications specified in our solution 
are implemented as a loadable kernel module. This module 
supports an lOCTL call to mark a core as tainted at the 
end of a protected function’s execution. The module also 
supports an lOCTL call that enables fast access to the interrupt 
and context-switch count. In the standard Linux kernel, the 
interrupt count is usually accessed through the proc file system 
interface. However, such an interface is too slow for our 
purposes. Instead, our kernel module allocates a page of 
counters that is mapped into the virtual address space of the 
calling process. The task struct of the calling process also 
contains a pointer to these counters. We modify the kernel 
to check on every interrupt and context switch if the current 
task has such a page, and if so, to increment the corresponding 
counter in that page. 

Offline profiling. We provide a profiling wrapper script, 
fixed time record . sh, that computes worst-case execution 
time parameters of each protected function as well as the 
worst-case slowdown on that function due to preemptions by 
different interrupts or kernel tasks. 


still access per-process performance counters that measure the 
performance of their own processes. 


For ensuring that a processor core executing a pro¬ 
tected function is not preempted by other user processes, 
as specified in Section IV we depend on a scheduling 
mode that prevents other userspace processes from preempting 
a protected function. For this purpose, we use the Linux 
SCHED FIFO scheduling mode at maximum priority. In order 
to be able to do this, we allow unprivileged users to use 
SCHED FIFO at priority 99 by changing the limits in the 
/etc/security/limits . conf file. 


One side effect of this technique is that if a protected 
function manually yields to the scheduler or perform blocking 
operations, the process invoking the protected function may 
be scheduled off. Therefore, we do not allow any blocking 
operations or system calls inside the protected function. As 
mentioned earlier, we also disable paging for the processes 
executing protected functions by using the mlockallO 
system call with the MCL_FUTURE. 

We detect whether a protected function has violated the 
conditions of isolated execution by determining whether any 
voluntary context switches occurred during the protected func¬ 
tion’s execution. This usually indicates that either the protected 
function yield the CPU manually or performed some blocking 
operations. 


The profiling script automatically generates profiling in¬ 
formation for all protected functions in an executable by 
running the application on different inputs. During the pro¬ 
filing process, we run a variety of applications in parallel 
to create a stress-testing environment that triggers worst-case 
performance of the protected function. To allow the stress 
testers to maximally slow down the user application, we reset 
the scheduling parameters and CPU affinity of a thread at 
the start and end of every protected function. One stress 
tester generates interrupts at a high frequency using a simple 
program that generates a flood of UDP packets to the lo^back 
network interface. We also run the mprim^ systesteij^ and 
the LINPACK benchmarlQ to cause high CPU load and large 
amounts of memory contention. 

C. Prevent leakage through shared resources 

Isolating a processor core and core-specific caches. We 
disable hyperthreading in Linux by selectively disabling virtual 
cores. This prevents any other processes from interfering with 
the execution of a protected function. As part of our prototype, 
we also implement a simple version of the page coloring 
scheme described in Section m 

We prevent a user from observing hardware performance 
counters showing the performance behavior of other users’ 
processes. The perf events framework on Linux mediates 
access to hardware performance counters. We configure the 
framework to allow accessing per-CPU performance counters 
only by the privileged users. Note that an unprivileged user can 


^http://www.mersenne.org/ 

^http://systester.sourceforge.net 

'^httpsV/software.intel.com/en-us/articles/intel-math-kernel-library-linpack- 
download/ 


Flushing shared resources. We modify the Linux scheduler 
to check the taint of a core before scheduling a user process 
on a processor core and to flush per-core resources if needed 
as described in Section m 

To flush the LI and L2 caches, we iteratively read over 
a segment of memory that is larger than the corresponding 
cache sizes. We found this to be significantly more efficient 
than using the WBINVD instruction, which we observed cost 
as much as 300 microseconds in our tests. We flush the 
LI instruction cache by executing a large number of NOP 
instructions. 

Current implementations of Linux flush the TLB during 
each context switch. Therefore, we do not need to separately 
flush them. However, if Linux starts leveraging the PCID 
feature of x86 processors in the future, the TLB would have 
to be flushed explicitly. For flushing the BTB, we leveraged 
a “branch slide” consisting of alternating conditional branch 
and NOP instructions. 

VI. Evaluation 

To show that our approach can be applied to protect a 
wide variety of software, we have evaluated our solution in 
three different settings and found that our solution successfully 
prevents local and remote timing attacks in all of these settings. 
We describe the settings in detail below. 

Encryption algorithms implemented in high level interpreted 
languages like Java. Traditionally, cryptographic algorithms 
implemented in interpreted languages like Java have been 
harder to protect from timing attacks than those implemented 
in low level languages like C. Most interpreted languages 
are compiled down to machine code on-the-fly by a VM 
using Just-in-Time (JIT) code compilation techniques. The 




Input 


JIT compiler often optimizes the code non-deterministically 
to improve performance. This makes it extremely hard for 
a programmer to reason about the transformations that are 
required to make a sensitive function’s timing behavior secret- 
independent. While developers writing low level code can 
use features such as in-line assembly to carefully control the 
machine code of their implementation, such low level control 
is simply not possible in a higher level language. 

We show that our techniques can take care of these issues. 
We demonstrate that our defense can make the computation 
time of Java implementations of cryptographic algorithms 
independent of the secret key with minimal performance 
overhead. 

Cryptographic operations and SSL/TLS state machine. Im¬ 
plementations of cryptographic primitives other than the pub¬ 
lic/private key encryption or decryption routines may also 
suffer from side channel attacks. For example, a cryptographic 
hash algorithm like SHA-1 takes different amount of time 
depending on the length of the input data. In fact, such timing 
variations have been used as part of several existing attacks 
against SSL/TLS protocols (e.g.. Lucky 13). Also, the time 
taken to perform the computation for implementing different 
stages of the SSL/TLS state machine may also be dependent 
on the secret key. 

We find that our protection mechanism can protect cryp¬ 
tographic primitives like hash functions as well as individual 
stages of the SSL/TLS state machine from timing attacks while 
incurring minimal overhead. 
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Sensitive data structures. Besides cryptographic algorithms, 
timing channels also occur in the context of different data 
structure operations like hash table lookups. For example, hash 
table lookups may take different amount of time depending on 
how many items are present in the bucket where the desired 
item is located. It will take longer time to find items in buckets 
with higher number of items than in the ones with less items. 
This signal can be exploited by an attacker to cause denial of 
service attacks ca. We demonstrate that our technique can 
prevent timing leaks using the associative arrays in C-H- STL, 
a popular hash table implementation. 

Experiment setup. We perform all our experiments on a 
machine with 2.3GHz Intel Xeon E5-2630 CPUs organized 
in 2 sockets each containing 6 physical cores unless otherwise 
specified. Each core has a 32KB LI instruction cache, a 32KB 
LI data cache, and a 256KB L2 cache. Each socket has a 
15MB L3 cache. The machine has a total of 64GB of RAM. 

Eor our experiments, we use OpenSSL version 1.0.11 and 
Java version BouncyCastle 1.52 (beta). The test machine runs 
Linux kernel version 3.13.11.4 with our modifications as 
discussed in Section lY] 

A. Security evaluation. 

Preventing a simple timing attack. To determine the effective¬ 
ness of our safe padding technique, we first test whether our 
technique can protect against a large timing channel that can 
distinguish between two different inputs of a simple function. 
To make the attacker’s job easier, we craft a simple function 
that has an easily observable timing channel—the function 


Fig. 4: Defeated distinguishing attack 


executes a loop for 1 iteration if the input is 0 and 11 iterations 
otherwise. We use the x86 loop instruction to implement 
the loop and just a single nop instruction as the body of the 
loop. We assume that the attacker calls the protected function 
directly and measures the value of the timestamp counter 
immediately before and after the call. The goal of the attacker 
is to distinguish between two different inputs (0 and 1) by 
monitoring the execution time of the function. Note that these 
conditions are extremely favorable for an attacker. 

We found that our defense completely defeats such a 
distinguishing attack despite the highly favorable conditions 
for the attacker. We also found that the timing randomization 
step (described in Section |IV-A| i is critical for such protection 
and a naive padding loop with any timing randomization step 
indeed leaks information. Eigure QA) shows the distributions 
of observed runtimes of the protected function on inputs 0 
and 1 with no defense applied. Eigure [^B) shows the runtime 
distributions where padding is added to reach T^ax = 5000 
cycles (« 2.17 fis) without the time randomization step. In 
both cases, it can be seen that the observed timing distribu¬ 
tions for the two different inputs are clearly distinguishable. 
Eigure QC) shows the same distributions when m = 5 rounds 
of timing randomization are applied along with time padding. 
In this case, we are no longer able to distinguish the timing 
distributions. 

We quantify the possibility of success for a distinguishing 
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Fig. 5: The effect of multiple rounds of randomized noise addition 
on the timing channel 


attack in Figure by plotting the variation of empirical 
statistical distance between the observed distributions as the 
amount of padding noise added is changed. The statistical 
distance is computed using the following formula. 

d{X,Y) = ^J2\P[X = ^]-P[Y = ^]\ 

ign 

We measure the statistical distance over the set of observations 
that are within the range of 50 cycles on either side of the me¬ 
dian (this contains nearly all observations.) Each distribution 
consist of around 600 million observations. 

The dashed line in Figure shows the statistical distance 
between two different instances of the test function with 0 as 
input. The solid line shows the statistical distance where one 
instance has 0 as input and the other has 1. We observe that 
the attack can be completely prevented if at least 2 rounds of 
noise are used. 

Preventing timing attack on RSA decryption We next evaluate 
the effectiveness of our time padding approach to defeat 
the timing attack by Brumley et al. m against unblinded 
RSA implementations. Blinding is an algorithmic modihcation 
to RSA that uses randomness to prevent timing attacks. To 
isolate the impact of our specific defense, we apply our 
defense to the RSA implementation in OpenSSL 1.0. Ih with 
such constant time defenses disabled. In order to do so, we 
conhgure OpenSSL to disable blinding, use the non-constant 
time exponentiation implementation, and use the non-word- 
based Montgomery reduction implementation. We measure the 
time of decrypting 256-byte messages with a random 2048-bit 
key. We chose messages to have Mon^omery representations 
differing by multiples of 2^°^®. Figure ^A) shows the average 
observed running time for such a decryption operation, which 
is around 4.16 ms. The messages are displayed from left to 
right in sorted order of how many Montgomery reductions 
occur during the decryption. Each message was sampled 
roughly 8,000 times and the samples were randomly split 
into 4 sample sets. As observed by Brumley et al. fHl, the 
number of Montgomery reductions can be roughly determined 
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Fig. 6: Protecting against timing attacks on unblinded RSA 


from the running time of an unprotected RSA decryption. Such 
information can be used to derive full length keys. 

We then apply our defense to this decryption with T^ax 
set to 9.68 X 10® cycles « 4.21 ms. One timer interrupt 
is guaranteed to occur during such an operation, as timer 
interrupts occur at a rate of 250/s on our target machine. We 
collect 30 million measurements and observe a multi-modal 
padded distribution with four narrow, disjoint peaks corre¬ 
sponding to the padding algorithm using different Tf.xt_preempt 
values for 1, 2, 3, and 4 interrupts respectively. The four 
peaks represent, respectively, 94.0%, 5.8%, 0.6%, and 0.4% of 
the samples. We did not observe that these probabilities vary 
across different messages. Hence, in Figure |^B), we show 
the average observed time considering only observations from 
within the first peak. Again, samples are split into 4 random 
sample sets, each key is sampled around 700,000 times. We 
observe no message-dependent signal. 

Preventing cache attacks on AES encryption. We next 
verify that our system protects against local cache attacks. 
Specihcally, we measured the effectiveness of our defense 
against the PRIME-i-PROBE attack by Osvik et.al ll35]l on the 
software implementation of AES encryption in OpenSSL. For 
our tests, we apply the attack on only the first round of AES 
instead of the full AES to make the conditions very favorable 
to the attacker as subsequent rounds of AES add more noise to 
the cache readings. In this attack, the attacker first primes the 
cache by filling a selection of cache sets with the attacker’s 
memory lines. Next, the attacker coerces the victim process 
to perform an AES encryption on a chosen plaintext on the 
same processor core. Finally, the attacker reloads the memory 
lines it used to fill the cache sets prior to the encryption. This 
allows the attacker to detect whether the reloaded lines were 
still cached by monitoring timing or performance counters and 
thus infer which memory lines were accessed during the AES 
encryption operation. 

On our test machine, the OpenSSL software AES imple- 
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mentation performs table lookups during the first round of 
encryption that access one of 16 cache sets in each of 4 lookup 
tables. The actual cache sets accessed during the operation are 
determined by XORs of the top 4 bits of certain plaintext 
bytes Pi and certain key bytes ki. By repeatedly observing 
cache accesses on chosen plaintexts where pi takes all possible 
values of its top 4 bits, but where the rest of the plaintext is 
randomized, the attacker observes cache line access patterns 
revealing the top 4 bits of pi 0 ki, and hence the top 4 bits 
of the key ki. This simple attack can be extended to learn the 
entire AES key. 

We use a hardware performance monitoring counter that 
counts L2 cache misses as the probe measurement, and for 
each measurement we subtract off the average measurement for 
that cache set for all values of pi. Figure [7| A) and Figure [TJ^B) 
show the probe measurements when performing this attack for 
all values of the top 4 bits of po (left) and p^ (right) with 
and without our protection scheme, respectively. Darker cells 
indicate elevated measurements, and hence imply cache sets 
that contain a line loaded by the attacker during the “prime” 
phase that was evicted by the AES encryption. The secret key 
k is randomly chosen, except that ko = 0 and k^ = SOdec- 
Without our solution, the cache set accesses show a pattern 
revealing pi © ki which can be used to determine that the 
top 4 bits of ko and k^ are indeed 0 and 5, respectively. Our 
solution flushes the F2 cache lazily before handing it over 
to any untrusted process and thus ensures that no signal is 
observed by the attacker as shown in Figure |^B). 

B. Performance evaluation 

Performance costs of individual components. Table shows 
the individual cost of the different components of our defense. 
Our total performance overhead is less than the total sum 
of these components as we do not perform most of these 
operations in the critical path. Note that retrieving the number 
of times a process was interrupted or determining whether a 
voluntary context switch occurred during a protected function’s 


Component 

Cost (ns) 

m = 5 time randomization step, WCET 

710 

Get interrupt counters 

16 

Detect context switch 

4 

Set and restore SCHED FIFO 

2,650 

Set and restore CPU affinity 

1,235 

Flush L1D-I-L2 cache 

23,000 

Flush BTB cache 

7,000 


TABLE I: Performance overheads of individual components of our 
defense. WCET indicates worst-case execution time. Only costs listed 
in the upper half of the table are incurred on each call to a protected 
function. 


execution is negligible due to our modihcations to the Finux 
kernel described in Section m 

Microbenchmarks: cryptographic operations in multiple lan¬ 
guages. We perform a set of microbenchmarks that test 
the impact of our solution on individual operations such as 
RSA and ECDSA signing in the OpenSSF C library and in 
the BouncyCastle Java library. In order to apply our defense 
to BouncyCastle, we constructed JNI wrapper functions that 
call the fixed time begin and fixed time end functions. Since 
both libraries implement RSA blinding to defend against 
timing attacks, we disable RSA blinding when applying our 
defense. 

The results of the microbenchmarks are shown in Table [III 
Note that the delays experienced in any real applications will 
be signihcantly less than these micro benchmarks as real 
applications will also perform some I/O operations that will 
amortize the performance overhead. 

For OpenSSF, our solution adds between 3% (for RSA) 
and 71% (for ECDSA) to the cost of computing a signature on 
average. However, we offer signihcantly reduced tail latency 
for RSA signatures. This behavior is caused by the fact that 
OpenSSF regenerates the blinding factors every 32 calls to 
the signing function to amortize the performance cost of 
generating the blinding factors. 

Focusing on the BouncyCastle results, our solution results 
in a 2% decrease in cost for RSA signing and a 63% in¬ 
crease in cost for ECDSA signing, compared to the stock 
BouncyCastle implementation. We believe that this increase 
in cost for ECDSA is justihed by the increase in security, 
as the stock BouncyCastle implementation does not defend 
against local timing attacks. Furthermore, we believe that some 
optimizations, such as conhguring the Java VM to schedule 
garbage collection outside of protected function executions, 
could reduce this overhead. 

Macrobenchmark: protecting the TLS state machine. We 
applied our solution to protect the server-side implementation 
of the TFS connection protocol in OpenSSF. The TFS protocol 
is implemented as a state machine in OpenSSF, and this pre¬ 
sented a challenge for applying our solution which is defined in 
terms of protected functions. Additionally, reading and writing 
to a socket is interleaved with cryptographic operations in the 
specification of the TFS protocol, which conflicts with our 
solution’s requirement that no blocking FO may be performed 
within a protected function. 












RSA 2048-bit sign 

Mean (ms) 

99% Tail 

Connection latency (RSA) 

Mean (ms) 

99% Tail 

OpenSSL w/o blinding 

1.45 

1.45 

Stock OpenSSL 

5.26 

6.82 

Stock OpenSSL 

1.50 

2.18 

Stock OpenSSL+ Our solution 

5.33 

6.53 

OpenSSL + our solution 

1.55 

1.59 

(sign only) 



BouncyCastle w/o blinding 

9.02 

9.41 

Stock OpenSSL+ Our solution 

5.52 

6.74 

Stock BouncyCastle 

9.80 

10.20 

Connection latency (ECDSA) 

Mean (ms) 

99% Tail 

BouncyCastle + our solution 

9.63 

9.82 

Stock OpenSSL 

4.53 

6.08 

ECDSA 256-bit sign 

Mean (ms) 

99% Tail 

Stock OpenSSL+ Our solution 

4.64 

6.18 

Stock OpenSSL 

0.07 

0.08 

(sign only) 



OpenSSL + our solution 

0.12 

0.38 

Stock OpenSSL+ Our solution 

4.75 

6.36 

Stock BouncyCastle 

0.22 

0.25 




BouncyCastle + our solution 

0.36 

0.48 

TABLE III: The impact on TLS vl.2 connection latency when apply- 




ing our defense to the OpenSSL server-side TLS implementation. 

LE 11: Impact on performance of signing a 

100 byte message 

We evaluate the cases where the the 

server uses 

an RSA 2048- 


using SHA-256 with RSA or ECDSA for the OpenSSL and Boun- 
cyCastle implementations. Measurements are in milliseconds. We 
disable blinding when applying our defense to the RSA signature 
operation. Bold text indicates a measurement where our defense 
results in better performance than the stock implementation. 


We addressed both challenges by generalizing the notion of 
a protected function to that of a protected interval, which is an 
interval of execution starting with a call to hxed time begin 
and ending with hxed time end. We then split an execution 
of the TLS protocol into protected intervals on boundaries 
dehned by transitions of the TLS state machine and on low- 
level socket read and write operations. To achieve this, we 
hrst inserted calls to hxed time begin and hxed time end at 
the start and end of each state within the TLS state machine 
implementation. Next, we modihed the low-level socket read 
and socket write OpenSSL wrapper functions to end the current 
interval, communicate with the socket, and then start a new 
interval. Thus divided, all cryptographic operations performed 
inside the TLS implementation are within a protected interval. 
Each interval is uniquely identihable by the name of the 
current TLS state concatenated with an integer incremented 
every time a new interval is started within the same TLS state 
(equivalently, the number of socket operations that occurred 
so far during the state.) 

The advantage of this strategy is that, unlike any prior 
defenses, it protects the entire implementation of the TLS 
state machine from any form of timing attack. However, such 
protection schemes may incur additional overheads due to 
protecting parts of the protocol that may not be vulnerable 
to timing attacks because they do not work with secret data. 


bit or ECDSA 256-bit signing key with SHA-256 as the digest 
function. Latency given in milliseconds and measures the end-to-end 
connection time. The client uses the unmodified OpenSSL library 
attempts. We evaluate our defense when only protecting the signing 
operation and when protecting all server-side routines performed as 
part of the TLS connection protocol that use cryptography. Even when 
the full TLS protocol is protected, our approach adds an overhead of 
less than 5% to average connection latency. Bold text indicates a 
measurement where our defense results in better performance than 
the stock implementation. 


than the average-case, the profiled worst-case execution time of 
the lookup when interrupts do not occur is L32/rs at k = 10“®. 
Thus, any timing channel defense will cause the lookup to 
take at least L32ps. The worst-case execution estimate of the 
lookup operation increases to 13.3/rs when interrupt cases are 
not excluded, hence our scheme benehts signihcantly from 
adapting to interrupts during padding for this example. Another 
major part of the overhead of our solution (0.710ps) comes 
from the randomization step to ensure safe padding . As we 

the randomization step is 


described earlier in Section VI-A 


crucial to ensure that there is no timing leakage. 

Hardware portability. Our solution is not specihc to any 
particular hardware. It will work on any hardware that supports 
standard cache hierarchy and where page coloring can be im¬ 
plemented. To test the portability of our solution, we executed 
some of the benchmarks mentioned in Sections fVI-Al and I VI-B I 
on a 2.93 GHz Intel Xeon X5670 CPU. We conhrmed that 
our solution successfully protects against the local and remote 
timing attacks on that platform too. The relative performance 
overheads were similar to the ones reported above. 


We evaluate the performance of the fully protected TLS 
state machine as well as an implementation that only protects 
the public key signing operation. The results are shown in Ta¬ 
ble |nl] We observe an overhead of less than 5% on connection 
latency even when protecting the full TLS protocol. 

Protecting sensitive data structures. We measured the over¬ 
head of applying our approach to protect the lookup operation 
of the C-H- STL unordered_map. For this experiment, we 
populate the hash map with 1 million 64-bit integer keys and 
values. We assume that the attacker cannot insert elements 
in the hash map or cause collisions. The average cost of 
performing a lookup of a key present in the map is 0.173/rs 
without any defense and 2.46/is with our defense applied. 
Most of this overhead is caused by the fact that the worst-case 
execution time of the lookup operation is signihcantly larger 


VH. Limitations 

No system calls inside protected functions. Our current 
prototype does not support protected functions that invoke 
system calls. A system call can inadvertently leak information 
to an attacker by leaving state in shared kernel data structures, 
which an attacker might indirectly observe by invoking the 
same system call and timing its duration. Alternatively, a 
system call might access regions of the L3 cache that can 
be snooped by an attacker process. 

The lack of system call support turned out to be not a big 
issue in practice as our experiments so far indicate that system 
calls are rarely used in functions dealing with sensitive data 
(e.g., cryptographic operations). However, if needed in future, 
one way of supporting system calls inside protected functions 

















while still avoiding this leakage is to apply our solution to the 
kernel itself. For example, we can pad any system calls that 
modify some shared kernel data structures to their worst case 
execution times. 

Indirect timing variations in unprotected code. Our ap¬ 
proach does not currently defend against timing variations in 
the execution of non-sensitive code segments that might get 
indirectly affected by a protected function’s execution. For 
example, consider the case where a non-sensitive function 
from a process gets scheduled on a processor core immediately 
after another process from the same user finishes executing a 
protected function. In such a case, our solution will not flush 
the state of per-core resources like LI cache as both these 
processes belong to the same user. However, if such remnant 
cache state affects the timing of the non-sensitive function, an 
attacker may be able to observe these variations and infer some 
information about the protected function. 

Note that currently there are no known attacks that could 
exploit this kind of leakage. A conservative approach that 
prevents such leakages is to flush all per-cpu resources at the 
end of each protected function. This will, of course, result 
in higher performance overheads. The costs associated with 
cleansing different types of per-cpu resources are summarized 
in Table U 

Leakage due to fault injection. If an attacker can cause 
a process to crash in the middle of a protected function’s 
execution, the attacker can potentially learn secret information. 
For example, consider a protected function that first performs 
a sensitive operation and then parses some input from the user. 
An attacker can learn the duration of the sensitive operation 
by providing a bad input to the parser that makes it crash and 
measuring how long it takes the victim process to crash. 

Our solution, in its current form, does not protect against 
such attacks. However, this is not a fundamental limitation. 
One simple way of overcoming these attacks is to modify 
the OS to apply the time padding for a protected function 
even after it has crashed as part of the OS’s cleanup handler. 
This can be implemented by modifying the OS to keep track 
of all processes that are executing protected functions at any 
given point of time and their respective padding parameters. 
If any protected function crashes, the OS cleanup handler for 
the corresponding process can apply the desired amount of 
padding. 

VIII. Related work 
A. Defenses against remote timing attacks 

The remote timing attacks exploit the input-dependent exe¬ 
cution times of cryptographic operations. There are three main 
approaches to make cryptographic operations’ execution times 
independent of their inputs: static transformation, application- 
specific changes, and dynamic padding. 

Application-specific changes. One conceptually simple way 
to defend an application against timing attacks is to modify its 
sensitive operations such that their timing behavior is not key- 
dependent. For example, AES QoIEtI [30l implementations 
can be modified to ensure that their execution times are 
key-independent. Note that, since the cache behavior impacts 


running time, achieving secret-independent timing usually re¬ 
quires rewriting the operation so that its memory access pattern 
is also independent of secrets. Such modifications are applica¬ 
tion specific, hard to design, and very brittle. By contrast, our 
solution is completely independent of the application and the 
programming language. 

Static transformation. An alternative approach to prevent 
remote attacks is to use static transformations on the imple¬ 
mentation of the cryptographic operation to make it constant 
time. One can use a static analyzer to find the longest possible 
path through the cryptographic operation and insert padding 
instructions that have no side-effects (like NOP) along other 
paths so that they take the same amount of time as the longest 
path ini I 20 I . While this approach is generic and can be 
applied to any sensitive operation, it has several drawbacks. In 
modern architectures like x86, the execution time of several 
instructions (e.g., the integer divide instruction and multiple 
floating-point instructions) depend the value of the input of 
these instructions. This makes it extremely hard and time 
consuming to statically estimate the execution time of these 
instructions. Moreover, it is very hard to statically predict the 
changes in the execution time due to internal cache collisions 
in the implementation of the cryptographic operation. To avoid 
such cases, in our solution, we use dynamic offline profiling 
to estimate the worst-case runtime of a protected function. 
However, such dynamic techniques suffer from incompleteness 
i.e. they might miss worst-case execution times triggered by 
pathological inputs. 

Dynamic padding. Dynamic padding techniques add a vari¬ 
able amount of padding to a sensitive computation that depends 
on the observed execution time of the computation in order 
to mitigate the timing side-channel. Several prior works E 
miEiiiniiiTi have presented ways to pad the execution of a 
black-box computation to certain predetermined thresholds and 
obtain bounded information leakage. Zhang et al. designed a 
new programming language that, when used to write sensitive 
operations, can enforce limits on the timing information leak¬ 
age Its]. The major drawback of existing dynamic padding 
schemes is that they incur large performance overhead. This 
results from the fact that their estimations of the worst-case 
execution time tend to be overly pessimistic as it depends on 
several external parameters like OS scheduling, cache behavior 
of the simultaneously running programs, etc. For example, 
Zhang et al. BtI set the worst-case execution time to be 300 
seconds for protecting a Wiki server. Such overly pessimistic 
estimates increase the amount of required padding and thus 
results in significant performance overheads (90 — 400% in 
macro-benchmarks El). Unlike existing dynamic padding 
schemes, our solution incurs minimal performance overhead 
and protects against both local and remote timing attacks. 

B. Defenses against local attacks 

Local attackers can also perform timing attacks, hence 
some of the defenses provided in the prior section may also 
be used to defend against some local attacks. However, local 
attackers also have access to shared hardware resources that 
contain information related to the target sensitive operation. 
The local attackers also have access to fine-grained timers. 

A common local attack vector is to probe a shared hardware 


resource, and then, using the fine-grained timer, measure how 
long the probe took to run. Most of the proposed defenses to 
such attacks try to either remove access to fine-grained timers 
or isolate access to the shared hardware resources. Some of 
these defenses also try to minimize information leakage by 
obfuscating the sensitive operation’s hardware access patterns. 
We describe these approaches in detail below. 

Removing fine-grained timers. Several prior projects have 
evaluated removing or modifying time measurements taken on 
the target machine ll^l?4l l42ll . Such solutions are often quite 
effective at preventing a large number of local side channel 
attacks as the underlying states of most shared hardware 
resources can only be read by accurately measuring the time 
taken to perform certain operations (e.g., read a cache line). 

However, removing access to wall clock time is not suffi¬ 
cient for protecting against all local attackers. For example, a 
local attacker executing multiple probe threads can infer time 
measurements by observing the scheduling behavior of the 
threads. Custom scheduling schemes (e.g., instruction-based 
scheduling) can eliminate such an attack ll?8]| but implementing 
these defenses require major changes to the OS scheduler. In 
contrast, our solution only requires minor changes to the OS 
scheduler and protects against both local and remote attackers. 

Preventing sharing of hardware state across processes. Many 
proposed defenses against local attackers prevent an attacker 
from observing state changes to shared hardware resources 
caused by a victim process. We divide the proposed defenses 
into five categories and describe them next. 

Resource partitioning. Partitioning shared hardware resources 
can defeat local attackers, as they cannot access the same 
partition of the resource as a victim. Kim et al. Il28ll present 
an efficient management scheme for preventing local timing 
attacks across virtual machines (VMs). Their technique locks 
memory regions accessed by sensitive functions into reserved 
portions of the L3 cache. This scheme can be more efficient 
than page coloring. Such protection schemes are comple¬ 
mentary to our technique. For example, our solution can be 
modified to use such a mechanism instead of page coloring to 
dynamically partition the L3 cache. 

Some of the other resource partitioning schemes (e.g., 
Ristenpart et al. llJTl ') suggest allocating dedicated hardware 
to each virtual machine instance to prevent cross-VM attacks. 
However, such schemes are wasteful of hardware resources as 
they decrease the amount of resources available to concurrent 
processes. By contrast, our solution utilizes the shared hard¬ 
ware resources efficiently as they are only isolated during the 
execution of the protected functions. The time a process spends 
executing protected functions is usually much smaller than the 
time it spends in non-sensitive computations. 

Limiting concurrent access. If gang scheduling is used 
or hyperthreading is disabled, an attacker can only observe 
per-CPU resources when it has preempted a victim. Hence, 
reducing the frequency of preemptions reduces the feasibility 
of cache-attacks on per-CPU caches. Varadarajan et al. HD 
propose using minimum runtime guarantees to ensure that a 
VM is not preempted too frequently. However, as noted in BTIl . 
such a scheme is very hard to implement in a OS scheduler 


as, unlike a hypervisor scheduler, an OS scheduler must deal 
with a unbounded number of processes. 

Custom hardware. Custom hardware can be used to obfuscate 
and randomize the victim process’s usage of the hardware. For 
example, Wang et al. iUIBl proposed new ways of designing 
caches that ensures that no information about cache usage is 
shared across different processes. However such schemes have 
limited practical usage as they, by design, cannot be deployed 
on off-the-shelf commodity hardware. 

Flushing state. Another class of defenses ensure that the state 
of any per-CPU hardware resources are cleared before trans¬ 
ferring them from one process to another. Diippel, by Zhang 
et al. Il50l . flushes per-CPU LI and (optionally) L2 caches 
periodically in a multi-tenant VM setting. Their solution also 
requires the hyperthreading to be disabled. They report around 
7% overheads on regular workloads. In essence, this scheme 
is similar to our solution’s technique of flushing per-CPU 
resources in the OS scheduler. However, unlike Diippel, we 
flush the state lazily only when a context switch to a different 
user process than the one executing a protected operation 
occurs. Also, Diippel only protects against local cache attacks. 
We protect against both local and remote timing and cache 
attacks while still incurring less overhead than Diippel. 

Application transformations. Sensitive operations like sensi¬ 
tive computations in different programs can also be modified 
to exhibit either secret-independent or obfuscated hardware 
access patterns. If the access to the hardware is independent 
of secrets, then an attacker cannot use any of the state leaked 
through shared hardware to learn anything meaningful about 
the sensitive operations. Several prior projects have shown 
how to modify AES implementations to obfuscate their cache 
access patterns Il9l[l0l[ll|35l|40l. Similarly, recent versions of 
OpenSSL use a specifically modified implementation of RSA 
that ensures secret-independent cache accesses. Some of these 
transformations can also be applied dynamically. For example. 
Crane et al. ED implement a system that dynamically applies 
cache-access obfuscating transformations to an application at 
runtime. 

However, these transformations are specific to particular 
cryptographic operations and are very hard to implement and 
maintain correctly. For example, 924 lines of assembly code 
had to be added to OpenSSL to implement make the RSA 
implementation’s cache accesses secret-independent. 

IX. Conclusion 

We presented a low-overhead, cross-architecture defense 
that protects applications against both local and remote timing 
attacks with minimal application code changes. Our exper¬ 
iments and evaluation also show that our defense works 
across different applications written in different programming 
languages. 

Our solution defends against both local and remote attacks 
by using a combination of two main techniques: (i) a time 
padding scheme that only takes secret-dependent time vari¬ 
ations into account, and (ii) preventing information leakage 
via shared resources such as the cache and branch prediction 
buffers. We demonstrated that applying small time pads ac¬ 
curately is non-trivial because the timing loop itself may leak 


information. We developed a method by which small time pads 
can be applied securely. We hope that our work will motivate 
application developers to leverage some of our techniques to 
protect their applications from a wide variety of timing attacks. 
We also expect that the underlying principles of our solution 
will be useful in future work protecting against other forms of 
side channel attacks. 
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